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Abstract 

In this paper, we tackle the problem of prediction and confidence intervals for time series 
using a statistical learning approach and quantile loss functions. In a first time, we show 
that the Gibbs estimator (also known as Exponentially Weighted aggregate) is able to 
predict as well as the best predictor in a given family for a wide set of loss functions. 
In particular, using the quantile loss function of Kocnker and Bassett (1978), this allows 
to build confidence intervals. We apply these results to the problem of prediction and 
confidence regions for the French Gross Domestic Product (GDP) growth, with promising 
results. 

Keywords: Statistical learning theory, time series prediction, quantile regression, GDP 
forecasting, PAC-Bayesian bounds, oracle inequalities, weak dependence, confidence inter- 
vals, business surveys. 



1. Introduction 

Motivated by economics problems, the prediction of time series is one of the most emblematic 
problem of statistics. Various methodologies are used that come from such various fields as 
parametric statistics, statistical learning, computer science or game theory. 

In the parametric approach, one assumes that the time series is generated according to 
a parametric model, like ARMA or ARIMA processes, see e.g. Hamilton (1994); Brockwell 
and Davis (2009). Such an assumption is unrealistic in many applications. However, un- 
der this assumption, it is possible to estimate the parameters of the model, and to build 
confidence intervals on the prevision. 

In the statistical learning point of view, one usually tries to avoid such restrictive para- 
metric assumptions - see, e.g., Cesa-Bianchi and Lugosi (2006); Stoltz (2010) for the online 
approach dedicated to the prediction of individual sequences, and Modha and Masry (1998); 
Meir (2000); Alquier and Wintenberger (2012) for the batch approach. However, in this 
setting, a few attention has been paid to the construction of confidence intervals or to 
any quantification of the precision of the prediction. This is a major drawback in many 
applications. 
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In Biau and Patra (2011), a method was proposed for the online approach: the idea 
is to minimize the cumulated risk corresponding to the quantile loss function defined by 
Koenker and Bassett (1978). Some asymptotic results are provided. 

In this paper, we propose to adapt this approach to the batch setting and provide 
nonasymptotic results. We also apply these results to build quarterly prediction and confi- 
dence regions for the French Gross Domestic Product (GDP) growth. Our approach is the 
following. We assume that we are given a set of basic predictors - this is a usual approach 
in statistical learning, the predictors are sometimes referred as "experts" , e.g. Cesa-Bianchi 
and Lugosi (2006). Following Alquier and Wintenberger (2012), we describe a procedure 
of aggregation, usually referred as Exponentially Weigthed Agregate (EWA), Dalalyan and 
Tsybakov (2008); Gerchinovitz (2011), or Gibbs estimator, Catoni (2004, 2007). It is in- 
teresting to note that this procedure is also related to aggregations procedure in online 
learning as the weighted majority algorithm of Littlestone and Warmuth (1994), see also 
Vovk (1990). We give a PAC-Bayesian inequality that ensures optimality properties for this 
procedure. In a few words, this inequality claims that our predictor performs as well as 
the best basic predictor up to a remainder of the order KL/^/n where n is the number of 
observations and fC measures the complexity of the set of basic predictors. This result is 
very general, two conditions will be required: the time series must be weakly dependent in 
a sense that we will make more precise in Section 4, and we need to have a Lipshitz loss 
function. This includes, in particular, the quantile loss functions. This allows us to apply 
this result to our problem of economic forecasting. 

The paper is organized as follows. Section 2 provides the notations used in the whole 
paper. Then, we give a description the Gibbs estimator in Section 3. The PAC-Bayesian 
inequality, Theorem 4.1, is given in Section 4, and the application to quantile losses and GDP 
forecasting in Section 5. Finally, the proof of Theorem 4.1 is postponed to the appendix. 

2. The context 

Let us assume that we observe X\ , X n from a M p -valued stationnary time series X = 
{^t)teZ defined on (O, A, P). From now, |.| will denote the Euclidian norm on IR P . Fix an in- 
teger k and let us assume that we are given a family of predictors {fg : (RP) k -> W, 9 £ G}: 
for any 9 and any t, fg applied to the last past values (Xt-i, ■ . ■ , X t -k) is a possible predic- 
tion of Xt- For the sake of simplicity, let us put for any t £ Z and any 9 S B, 

X-t = f$(Xt-i, ■ ■ ■ iXt-k)- 

We also assume that i— > fg is linear. Note that we may want to include parametric models 
as well as non-parametric prediction. In order to deal with various family of predictors, we 
propose a model-selection type approach: 

m 

e = U%. 
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Example 2.1 A first example is the linear auto-regressive class of predictors. We can take 
9 = (0 O , #1, ... A) £ @ = ^ fc+1 and 

k 

fe(Xt-i> ... , X t -k) = #o + OjXt-j- 

i=i 

In this case we deal with only one model, m = 1 and = &\. 

Example 2.2 We may generalize the previous example to non-parametric auto-regression, 
for example using a dictionnary of functions (W) k — > W 1 , say (fi)^l - Then we can fix 
m = n, and take = (9i, . . . , Qi) € &j = M J and 



fe{Xt-i, ■ ■ ■ , X t -k) — ^2 Qi<Pi(Xt-l) ■ ■ ■ ,x, 



t-k) 
i=\ 

Finally, we have to define a quantitative criterion to evaluate the quality of the predictions. 
Let I be a loss function. More precisely, we will assume that I satisfies the following 
assumption. 

Assumption LipLoss: i is given by: £(x,x') = g(x — x') for some convex function g 
satisfying g > 0, g(0) = and g is if-Lipshitz. 

Definition 2.1 We put, for any 9 € G, 

R{9) = E [i[x e tl X^ . 
Note that because of the stationnarity, R{9) does not depend on t. 

Example 2.3 A first example is £(x,x') = \\x — x'\\. In this case, the Lipshitz constant K 
is 1. This example was studied in detail in Alquier and Wintenberger (2012). In Modha 
and Masry (1998); Meir (2000), the loss function is the quadratic loss £(x,x') = \\x — x'\\ 2 . 
Note that it also satisfies our Lipshitz condition, but only if we assume that the time series 
is bounded. 

Example 2.4 When the time-series is real-valued, we can use a quantile loss function. The 
class of quantile loss functions is defined as 



£ T (x,y) 




if x — y > 
y) , otherwise 



where r G (0,1). It is motivated by the following remark: if U is a real-valued random 
variable, then any value t* satisfying ¥(U < t*) = r is a minimizer of of t i— > K(l T (X — 
t)); such a value is called quantile of order r of U . This loss function was introduced by 
Koenker and Bassett (1978) for "quantile regression", since then it became a classical tool 
in statistics, see e.g. Koenker (2005) for a survey. Recently, Belloni and Chernozhukov 
(2011) used it in the context of high- dimensional regression with the LASSO and by Biau 
and Patra (2011) used it to build non-parametric confidence intervals on time-series. 
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3. Gibbs estimator 

We introduce in this section the Gibbs estimator. As already mentionned in the introduc- 
tion, such aggregated estimators were used in learning theory under the name weighted 
majority aggregate, EWA... 

Definition 3.1 We define, for any 9 £ Q, the empirical risk 

i n 

i=fc+i 

Let T be a u-algebra on © and 7~e be its restriction to ©^ for any £ G {l,...,m}. 
Let denote the set of all probability measures on (0,T). Let ir S M+(@). This 

probability measure is usually called the prior by analogy with Bayesian statistics. Actually, 
it will be used as a tool to control the complexity of the set of predictors 0. 

Remark 3.1 In the case where = Uj0j and the Qj are disjoint, we can write 

m 
3=1 

where \ij := 7r(0j) and irj(d6) := 7r(d#)le . (6) / fj,j . Note that irj can be interpreted as a 
prior probability measure inside the model Qj and that the weights fj,j can be interpreted as 
a priori probability measure between the models. 

Definition 3.2 We put, for any X > 0, 

k = [ 9p x (d9) 
Je 

where 

Pxid6) ~ Je^rnnWY 

Remark 3.2 Note that analogously to Bayesian estimator, the Gibbs estimator can is writ- 
ten as an integral on the parameter space. It can thus be computed by Monte Carlo methods, 
see Robert (1996); Marin and Robert (2007). This is the approach that we will use in this 
paper. 

Remark 3.3 The choice of the parameter A is discussed in the next section. 

4. Theoretical results 

In this section, we provide a PAC-Bayesian oracle inequality for the Gibbs estimator. PAC- 
Bayesian were introduced in the context of supervised classification (using the 0/1-loss), see 
the seminal papers Shawe- Taylor and Williamson (1997); McAllester (1999). More general 
versions can be found in Catoni (2004, 2007). These results were generalized to different 
contexts and loss functions, see Alquier (2008) for a presentation with a general loss function. 
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See also Audibert (2010) for a nice survey of the more recent advances. The idea is that 
the risk of the Gibbs estimator will be close to infg R{9) up to a small remainder. More 
precisely, we upper-bound it by 



{/ 



inf | / R(9)p(d9) + remainder(/9, it) 

where the inf is taken upon all the probability distributions on 0. 

In order to be able to control the prevision risk of our estimator 9\, R(9\), we will need 
some hypothesis. The first hypothesis concerns the dependence of the process, it uses the 
#oo,n(l)-coefncients of Dedecker et al. (2007). Such a condition is also used in Alquier and 
Wintenberger (2012), and is more general than the mixing conditions used in Meir (2000); 
Modha and Masry (1998). 

Assumption WeakDep: we assume that the distribution P is such that the stationary 
process {X t )t^i is bounded, ie a.s. ||Xo||oo < B < oo, and such that there is a constant C 
with #oo,fc(l) < C < oo for any k. We remind that for any u-algebra S C A, for any g£N, 
for any (M p ) 9 -valued random variable Z defined on (£L,A, P), we put 



where 



Al = \ f : (Rf )* 



/eAj 



M6,Z)=sup E[/(Z)|6]-E[/(Z)] 



E?=i II** 



\f( Zl ,...,Z q ) - f(z[,. . . ,Z')\ 



< 1 



and that 



0oo,*(l) := sup {9 OQ (a(X t ,t < p), (X h , . . .,X jt )), p < ji < . . . < j t , 1 < £ < k} . 

Remark 4.1 Some examples of processes satisfying WeakDep are provided, for example, 
in Alquier and Wintenberger (2012). It includes the large family of bounded causal Bernoulli 
shifts, that is bounded processes of the form 

x t = #(6,6-i>6-2, • • • ) 

where the "innovations" £t are iid and bounded and H satisfies a Lipshitz-type condition. In 
particular, this includes ARMA processes with bounded innovations. It also includes uniform 
ip-mixing processes, defined e.g. in Doukhan (1994); Ri° (2000a), and some dynamical 
systems. 

Assumption Lip: for any 9 € Q we assume that there are coefficients aj (9) for 1 < j < k 
satisfying, for any x\, x^ and y%, y^, the relation 

k 

\\fe (x 1 ,...,x k )- f g (yi,.. .,y k )\\ < (9) \\xj - yj\\ . 

j=l 

We define L := sup d£ Q Ylj=i a j (^) an( i assume that this value is finite. 
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Theorem 4.1 (PAC-Bayesian Oracle Inequality) Let us assume that assumptions Li- 
pLoss, WeakDep and Lip are satisfied. Then, for any A > 0, for any £ > 0, 



R [ 8\) < inf 

peMUe) 



Rdp + 



2\k 2 



n 



k\2 



(i-S) 



+ 



2/C(p,7r) + 21og 



A 



> 1-e 



where k = k(K, L,B,C) := i^(l + L)(iB + C)/y2 on<i where we remind that JC(p,ir) is the 
Kullback divergence between p and it, defined by 



KM 



/log [£(ej\ P (d9) if P <7T, 

+oo otherwise. 



Remark 4.2 The choice of A in practice may be a problem. In Catoni (2003, 2007) a 
general method is proposed to optimize the bound with respect to A. However, while adapted 
in the iid case, this method is more difficult to use in the context of time series as it would 
require the knowledge of k, and so the knowledge of #oo,n(l) - or at least the knowledge of 
an explicit upper bound for #oo,n(l)- I n practice, however, some empirical calibration seems 
to give good results, as shown in Section 5. 



Remark 4.3 We want to mention that, at the price of a much more technical analysis, 
this result can be extended to the case where the Xt are not assumed to be bounded. In 
the iid case, it is possible to obtain results under the existence of moments of order 4 only, 
see Audibert and Catoni; Catoni. In the context of time series, the results in Alquier and 
Wintenberger (2012) require subGaussian tails for Xt, but suffer a log(n) loss in the learning 
rate. 



5. Application to French GDP and quantile prediction 

We now in this section an application to data published by the INSEE {Institut National 
de la Statistique et des Etudes Economiques, the French national bureau of statistics). 

5.1. Uncertainty in GDP forecasting 

Every quarter t, economic forecasters at INSEE are asked a prediction for the quarterly 
growth rate of the French Gross Domestic Product (GDP). Since it involve a lot of informa- 
tion, the "true value" of the growth rate log(GDPf/GDPf_i) is only known after two years, 
but flash estimates of the growth rate, say AGDP^, are published 45 days after the end of 
the current quarter t. One of the most relevant economic information available at time t to 
the forecaster, apart from past GDP observations, are business surveys. Indeed, they are a 
rich source of information, for at least two reasons. First, they are rapidly available, on a 
monthly basis. Moreover, they provide information coming directly from the true economic 
decision makers. 

A business survey is traditionally a fixed questionnaire of ten questions sent monthly 
to a panel of companies. This process is described in Devilliers (1984). INSEE publishes a 
composite indicator called the French business climate indicator: it summarises information 
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of the whole survey. This indicator is defined in Clavel and Minodier (2009), see also Dubois 
and Michaux (2006). All these values are available from the INSEE website 

http : //www. insee . f r/ 

Note that a quite similar approach is used in other countries, see also Biau et al. (2008) for 
a prediction of the European Union GDP based on EUROSTATS data (EUROSTAT is the 
EU bureau of statistics). 

It is however well known among economic forecasters that interval confidence or density 
forecasts are to be given with the prediction, in order to provide an idea of the uncertainty 
of the prediction. The ASA and the NBER started using density forecasts in 1968, see 
Diebold et al. (1997); Tay and Wallis (2000) for historical surveys on density forecasting. 
The Central Bank of England and INSEE, among others, provide their prediction with a 
"fan chart", Britton et al. (1998). However, it is interesting to note that the methdology 
used is often very crude, see the criticism in Cornec (2010); Dowd (2004). For example, until 
2012, the fan chart provided by the INSEE led to the construction of confidence intervals 
with constant length. But there is an empirical evidence that it is more difficult to forecast 
economic quantities during crisis (e.g. the subprime crisis in 2008). The Central Bank 
of England fan chart is not reproducible as it includes subjective information. Recently, 
Cornec (2010) proposed a clever density forecasting method based on quantile regressions 
that gives satisfying results in practice. However, this method did not receive any theoretical 
support up to our knowledge. 

Here, we use the Gibbs estimator described in the previous sections to build a forecast- 
ing of AGDPf, using the quantile loss function. This allows to return a prediction: the 
forecasted median, for r = 0.5, that is theoretically supported. This also allows to provide 
various confidence intervals corresponding to various quantiles. 

5.2. Application of Theorem 4.1 

At each quarter t, the objective is to predict the flash estimate of GDP growth, AGDP^. 
As described previouly, the available information is AGDP^ for t' < t and I? for t' < t, 
where for notational convenience, It~\ is the climate indicator available to the INSEE at 
time t (it is the mean of the climate indicator at month 3 of quarter t — 1 and at month 1 
and 2 of quarter t). The observation period is 1988-Q1 (1st quarter of 1988) to 2011-Q3. 

We define Xt = (AGDPt,!)' 6 M 2 . As we are not interested by the prevision of It but 
only by the prediction of the GDP growth, the loss function will only take into account 
AGDPf. We use the quantile loss function of Example 2.4: 



In order to clearly know what is the value r we are dealing with, we will now add a 
subscript r in the notation of the prevision risk: 



^.((AGDPt, I t ), (A'GDPt, ij)) 




(1 - r) (AGDP t - A'GDPt) , otherwise. 




R T (0) := E [£ T (AGDP 4 , / e (X t _i, X t _ 2 ))] . 
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We also let denote the associated empirical risk. 

Following Cornec (2010); Li (2010) we consider predictors of the form: 

f e (X t ^,X t ^ 2 ) = 9 + #iAGDP t _x + 6 2 It-i + W-i - h-2)\It-i - h-2\ (i) 
where 9 = (9 , 9 X ,9 2 , 9 3 ) G @(B). For any B > we define 

g(b) = j# = (9o,e 1} e 2 ,e 3 ) g r\ \\0\h = X>*I ^ B | • 

These predictors of Equation 1 correspond to the model used in Cornec (2010) for forecast- 
ing, one of the conclusions of Cornec (2010); Li (2010) is that these family of predictors 
allow to obtain a forecasting as precise as the INSEE one. 

For technical reason that will become clear in the proofs, if one wants to achieve a 
prediction performance comparable to the best 9 G @(B), it is more convenient to define the 
prior 7r as the uniform probability distribution on some slightly larger set, e.g. Q(B + 1). We 
will let IL3 denote this distribution. We let p B A and 9 T B A denote repectively the associated 
agregation distribution and the associated estimator, defined in Definition 3.2. 

Remark that in this framework, Assumption Lip is satisfied with L = B + 1, and the 
loss function is X-Lipshitz with K = 1 so Assumption LipLoss is also satisfied. 

Theorem 5.1 Let us fix r G (0, 1). Let us assume that Assumption WeakDep is satisfied, 
and that n > max (10, k 2 /(3,6 2 )) . Let us fix A = \/3n/K. Then, with probability at least 
1 — e we have 

R'(n,x)< inf. lfl T («)+ 2 " :1; ' 



2.25 + log| (B + 1)S ^ + ^ 



060(B) I V n 

A detailed proof is given in the appendix. 

The choice of A proposed in the theorem may be a problem as in practice we will not 
know k. Note that from the proof, it is obvious that in any case, for n large enough, when 
A = y/n we still have a bound 

C(B,B, K,e) 



R T (9 T BX )< inf <R T (9) 

However, in practice, we will work in an online setting: at each date t we compute the 
Gibbs estimator based on the observations from 1 to t and use it to predict the GDP and 
its quantiles at time t + 1. Let 9 T B x [t] denote this estimator. We propose the following 
empirical approach: we define a set of values A = {2 k , k G N} n {1, n}. At each step t, 
we compute 9 T B x [t] for each A G A and use for prediction 9 T B x u\[t] where X(t) is defined by 

t-i 

X(t) = argmin Y^l T (AGDP h f §T (Xj^X^)), 
i=3 

namely, the value that is currently the best for online prediction. This choice leads to good 
numerical results. 

In practice, the choice of B has less importance. As soon as B is large enough, the 
estimator does not really depend on B, only the theoretical bound does. As a consequence 
we take B = 100 in our experiments. 
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5.3. Implementation 

We use the importance sampling method to compute BX [t] (see, e.g., Robert (1996)). We 

draw an iid sample T±, T/v of vectors in M 4 , from the distribution J\f(0 T ,vI) where v > 
and 8 T si simply the r-quantile regression estimator of 8 in (1), as computed by the "quantile 
regression package" of the R software R Development Core Team (2008). Let g(-) denote 
the density of this distribution. Then, by the law of large numbers we can approximate 



N 

£ 



T-exp [-\r t (Tj)] l e (B+i)(?i) _a 



t bM 



Remark that this is particularly convenient as we only simulate the sample T±, Tjy once 
and we can use the previous formula to approximate 9 T B x [t] for several different values of r. 



5.4. Results 

The results are shown in Figure 1 for prediction, r = 0.5, in Figure 2 for confidence interval 
of order 50%, i.e. r = 0.25 and r = 0.75 (left) and for confidence interval of order 90%, 
i.e. r = 0.05 and r = 0.95 (right). We report only the results for the period 2000-Q1 to 
2011-Q3 (using the period 1988-Q1 to 1999-Q4 for learning). 

Out-of-sample forecasts 




Figure 1: French GDP online prediction using the quantile loss function with r = 0.5. 



Note that we can compare the ability of our predictor B \ with the predictor used in Li 

(2010) that relies on a least square estimation of (1), that we will denote by 9* . Interestingly, 
both are quite similar but t9^ 5 A is a bit more precise. We remind that 



mean abs. pred. error = - Ylt=i 
mean quad. pred. error = - J2t=i 



AGDP t — faO.5 r,, (Xt-l, X t -2 

AGDP t — fzo.s , f ,(Xt-i, Xt-2) 

°B,\(t) Vi 
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Out-of-sample forecasts Out-of-sample forecasts 




2000 2002 2004 2006 2008 2010 2000 2002 2004 2000 2008 2010 

Figure 2: French GDP online 50%-confidence intervals (left) and 90%-confidence intervals 
(right). 



Predictor 


Mean absolute prevision error 


Mean quadratic prevision error 




0.22360 


0.08033 




0.24174 


0.08178 



We also report the frequency of realizations of the GDP falling above the predicted 
r-quantile for each r. Note that this quantity should be close to r. 



Estimator 


Frequency 




0.065 




0.434 


^% 


0.608 


0$ 


0.848 




2D.95 

"b,x 


0.978 



It can be seen that our method behaves quite well in practice. As the INSEE did, we 
miss the value of the 2008 crisis. However, it is interesting to note that our confidence 
interval shows that our prediction at this date is less reliable than the previous ones: so, at 
this time, the forecaster could have been aware of some problems in their predictions. 

6. Conclusion 

We proposed some theoretical results to extend learning theory to the context of weakly 
dependent time series. The method showed good results on an application to GDP fore- 
casting. It would also be interesting to give theoretical results on the online risk of our 
method, e.g. following tools in Catoni (2004); Gerchinovitz (2011). From both theoretical 
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and practical perspective, an adaptation with respect to the dependence coefficient 0oo,n(l) 
would also be really interesting but is probably a more difficult objective. 
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Appendix A. Proofs 

A.l. Some preliminary lemmas 

First, we remind Rio's Hoeffding type inequality. 
Lemma 1 (Rio Rio (2000b)) Let h be a function (W) n 



such that 



V(xi, • • • , x n , yi, ... , y n ) G (R p ) 2n , \h(x l7 ...,x n )- h(y 1: . . .,y n )\ < ^ \\xi - yi\ 

8=1 



(2) 



Then for any t > we have 

E ft{Klh(X 1 ,...,X n )]-h(X u ...,X n )}^ < 



t 2 n(B + e oo ,„(l)) 2 

e 2 



Note that others Hoeffding and Bernstein type inequalities could be used to obtain PAC- 
Bounds in the context of time series. The monographs Doukhan (1994); Rio (2000a) provide 
nice review of the results available for mixing time series. Note however that weak depen- 
dence assumptions are usually more general, some inequalities are provided in Dedecker 
et al. (2007), a nice review and new results are given in Wintenberger (2010). See also the 
martingale approach in Seldin et al. (2011). However, Lemma 1 is particularly convenient 
in this setting, and leads to particularly general hypothesis. 
Using Lemma 1, we can prove the following lemma. 

Lemma 2 Let us assume that Assumptions LipLoss, WeakDep and Lip are satisfied. 
For any A > 0, for any 6 £ Q, we have 



E ( e W)-r«Wl) < e "M) and E < e-M) , 

where we remind that k = K(l + L)(B + C)/y/2. 

Proof Let us fix A > and 6 G O. Let us define the function h by: 



h(xi, ...,x n ) 



1 



K(l + L) 



^2 ^{fe{xi-i 1 .-.,x i _ k ),x i ). 



=fc+i 



We now check that h satisfies (2), 

h (zi, . ..,x n ) - h(yi, ...y n ) 



< 



< 



I .... , x n ) 
1 

K (1 + L) 
1 

K{\ + L) 



^2 VUe{xi-u ■ ■ .,xi- k ),xi) -£(fe(yi-i, ■ ■ -,yi-k),yi) 



=k+l 
n 



^2 \g(fe{xi-i, ■ ■ ■ ,Xi_ k ) - Xi) - g(fe(yi-i, ■ ■ ■ ,Vi-k) ~ Vi) 



=fe+i 



1 ™ II 

< Y^L ^ l^^- 1 '---'^-^) ~ Xi ) ~ {MVi-U ■ ■ ■ ,Vi-k) - Vi) 
i=k+l 
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where we used Assumption LipLoss for the last inequality. So we have 

h(x\, . . . ,x n ) -h(yi,...y n ) 
1 n f\\ 



i=k+l 



+ 



Hi 



- n I k 

i=fc+l \i=l 

i=i \ j=i / 

n 



Ui\ 



where we used Assumption Lip. So we can apply Lemma 1. Note that h(X\, . . . ,X n ) = 
jf0^r n (9), E(h(X u . . .,X n )) = j^0^R(9) and we choose t = K(l + L)\/(n - k), we 



obtain: 



A 2 X 2 (l+i) 2 (B+e o,, l (l)) 2 



< e 



because of Assumption WeakDep. This ends the proof of the first inequality. The reverse 
inequality is obtained by replacing the function h by — h. ■ 

We also remind the following classical result concerning the Kullback divergence function. 

Lemma 3 (Legendre transform of the Kullback divergence function) For any ir £ 

A4\_(E), for any measurable function h : E — > IR such that 7r[exp(/i)] < +oo we have: 



(3) 



7r[exp(/i)] = exp sup ( p[h] — )C(p, ir] 
\ P eM\(E) 



with convention oo — oo = — oo. Moreover, as soon as h is upper-bounded on the support of 
ir, the supremum with respect to p in the right-hand side is reached for the Gibbs measure 
7r{/i} defined by 

e h ^7r(dx) 



ir{h}{dx) 



7r[exp(/i)] 



Actually, it seems that in the case of discrete probabilities, this result was already known 
by Kullback (Problem 8.28 of Chapter 2 in Kullback (1959)). For a complete proof in the 
general case, we refer the reader for example to Catoni (2003, 2007). We are now ready to 
state the following key result. 
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Lemma 4 Let us assume that Assumptions LipLoss, WeakDep and Lip are satisfied. 
Let us fix A > 0. Let k be defined as in Lemma 2. Then, 



and 



> > 1 -e. 



Proof Let us fix 8 > and A > 0, and apply the first inequality of Lemma 2. We have: 



E e 



R{6)-r n (e)- 



< i. 



and we multiply this result by e/2 and integrate it with respect to 7r(d#). Fubini's Theorem 
gives: 



E 



X[R(6)-r n (e)]-- 



f) 



vr(d^) | < -. 



We apply Lemma 3 and we get: 
/ 



E 



sup i A 



f[R(0)-r n (0)]p(dg)- A \ a -log( f ) -/C(p, W ) 



V 



< 



As > 1r + (x), we have: 



sup < A / [i?(0)-r n (0)]p(d0)- - 
p n 



2,^2 



;i _^-i og ^j-x:(p )7 r)f>of< 2 . 



Now, we follow the same proof again but starting with the second inequality of Lemma 2. 
We obtain: 



P \ sup \ A f \r n {6) - R(9)} p(d9) - — - 
[ P [ J n{\ 

A union bound ends the proof. 



\ 2 K 2 



i)3 bg^)-^)[>o[<|. 



A. 2. Proof of Theorems 4.1 and 5.1 

Proof [Proof of Theorem 4.1] Remark that LipLoss, WeakDep and Lip are satisfied. We 
apply the first inequality of Lemma 4. We obtain that with probability at least 1 — e, we 
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are on the event 



We apply the first inequality of (4) to p\(d9). We obtain: 



and 

/ r n dp < J Rdp 



xk 2 , x:(P.T)+i°g(f) 



(4) 



+ 



P < I R(9)p x (d9) < I r n (9)p x (d9) + A ** + i log 0) + i/C(&, tt) 1 > 1 - |. 



According to Lemma 3 we have: 

J r n (9)p x (d9) + ix:(p A ,7r) = inf (J r{9)p{d9) + ^JC(p,n) 



so we obtain 



P | y R(9) Px (d6) < inf 



r n (9)p(d0) H : — + 



n 



fc\2 



A 



>1-J. (5) 



We now want to bound from above r(0) by R{9). Applying the second inequality of (4) 
and plugging it into Inequality 5 gives 



R(9)p x (d9) < inf 
p 



f 2 W x 2Ak 2 

/ fldp+ T X:(p,7r) + — — fcx 

7 A "(1-1) 



2 + A l0b V, 



We end the proof by the remark that 9 h-> -R(#) is convex and so 

y i?(e)p A (d^) > ^ (y ^A(d^) = r(9 X ). 



Proof [Proof of Theorem 5.1] We can apply Theorem 4.1 with R = R T . We have, with 
probability at least 1 — e, 



fi T (fl T BA )< inf 



R T dp + 



2Xk 2 2/C(p,7r) + 21og(f) 



2\ 2 



A 



Now, let us fix (5 G (0, 1] and G Q(B). We define the probability distribution pg^ as the 
uniform probability measure on the set: 



{r G 



r||i<<5}. 
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Note that pg t s <C ttb as ttb is defined as uniform on Q(B + 1) D Q(B + 5). Then: 



R T (9 T ux)< inf inf 

' 6»e0(B)<5>O 



/ R T dp e ,s H : — + 



2\ 2 



A 



(6) 



Now, we have to compute or to upper-bound all the terms in the right-hand side of this 
inequality. First, note that: 



R T dpe,s 

" ||0-T||i<«S} 
Then, let us remark that: 



R T (T)dp e ,s{T) < R T {6) + 2S<5max(r, 1 - r) < R T {6) + 2B5. (7) 



]C(pe,8,K B ) = 3 log 



5 + 1 



(8) 



We plug (7) and (8) into (6) to obtain: 



iT(^;0<infinf iT(0) + 2 



2\ 2 



A 



It can easily be seen that the minimum of the right-hand side w.r.t. 5 is reached for 
5 = 3/(B\) (we will have to be careful with the choice of A to ensure that 5 < 1), and so: 



2XK 2 61og((™H) + 21og. 
R T (0l x )<M<R T (e)+ , „, + 



n 



2\2 



(1-1) 



A 



We finally minimize the r.h.s. (roughly) with respect to A to propose: A = \/3n/K, this 
leads to: 



Remark that the condition <5 < 1 is satisfied as soon as n > k 2 /(3,6 2 ). Also, when n > 10 
we have: 

1 25 

< 



(1-1) 

and we can re-organize the terms to obtain: 



2^ " 16 



R r (ff r B ,)<M{R r (0) + 



2^/3 h 



n 



2.25 + log r B+ " g ^ + '° e ^ 
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